Introduction to Open Data Science - Course Project

About the project

RStudio Exercise 1:

for course Introduction to Open Data Science 2021

Tasks and Instructions 1.(/5)

:white_check_mark:

Tasks and Instructions 2.(/5)

Feeling a bit exited for this course.

I heard about this course through University of Eastern Finland (internal communication) Yammer pages.

Link to Github repository here.

date()
## [1] "Mon Dec 06 03:14:22 2021"

RStudio Exercise 2

This week I have done quite a bit of DataCamp practices. So much so that I went on to next weeks’ DataCamp practices as DataCamp’s bottom grey-blue bar indicator of the progress was showing five distinct chapters. I did not release that one bar was for one week. This made me stress quite a bit as there didn’t seem to be an end to the DataCamp practices.

Unfortunately I have not managed to start reading the recommended reading material this week, which I really know I should.

I started the exercises too late.

Describe the work you have done this week and summarize your learning.

date()
## [1] "Mon Dec 06 03:14:22 2021"

1. Read the data

Read the data file to an object. Display the object (data table).

# Read the data
learning2014 <- read.table(
  file = "http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/learning2014.txt",
  header = TRUE,
  sep = ",")

# check first few lines of the read data... table
head(learning2014)
##   gender age attitude     deep  stra     surf points
## 1      F  53      3.7 3.583333 3.375 2.583333     25
## 2      M  55      3.1 2.916667 2.750 3.166667     12
## 3      F  49      2.5 3.500000 3.625 2.250000     24
## 4      M  53      3.5 3.500000 3.125 2.250000     10
## 5      M  49      3.7 3.666667 3.625 2.833333     22
## 6      F  38      3.8 4.750000 3.625 2.416667     21

Looking at the table, note to prior Data Wrangling part: attitude should have been scaled (divided by 10) as in DataCamp exercises. Exercise instructions were unclear on this (in the Data Wranglin part).

Check data dimensionality:

dim(learning2014)
## [1] 166   7

Check data structure:

str(learning2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : chr  "F" "M" "F" "M" ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...

Dataset is from a questionaire study, “international survey of Approaches to Learning”. Check here.

“gender” is sex/gender of the questionaire participant. Data type is character (chr).
“age” is the age of the participant in year (presumably). Data type is integer (int).
“attitude” is “Global attitude toward statistics” divided by 10. Data type is integer (int).
“deep” is a mean value of the following original variables: “D03”, “D11”, “D19”, “D27”, “D07”, “D14”, “D22”, “D30”,“D06”, “D15”, “D23”, “D31”, summarizing deep learning of the participant. Data type is numeric (can be numbers with decimals).
stra is a mean value of the following original variables: “ST01”,“ST09”,“ST17”,“ST25”,“ST04”,“ST12”,“ST20”,“ST28”, summarizing strategic learning of the participant. Data type is numeric (can be numbers with decimals).
surf is a mean value of the following original variables: “SU02”,“SU10”,“SU18”,“SU26”, “SU05”,“SU13”,“SU21”,“SU29”,“SU08”,“SU16”,“SU24”,“SU32, summarizing surface learning of the participant. Data type is numeric (can be numbers with decimals).
points are exam points. Data type is integer (int).

2. Graphical overview, summaries

First we might need to install some packages to make some nice graphical overviews:

# uncomment if you need to install
#install.packages("GGally", "ggplot2")

Let’s start plotting some scatter plot matrix from our data:

# access the GGally and ggplot2 libraries
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(ggplot2)

# create a more advanced plot matrix with ggpairs()
p <- ggpairs(learning2014, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))

# draw the plot
p

As you can see the (multi)plot above is quite busy. Lots of information to take in.
Let’s start of by noting the column names above and row names on the right side. There is no legend for gender, but checking the plot there is F and M, F is for Female and M is for Male. So this plot is color coded by gender.

Top left corner, gender column and gender row, we can see the counts of each value. F pinkish color, M cyan. We can see the female participant count is almost double the male count. Let’s check this:

table(learning2014$gender)
## 
##   F   M 
## 110  56

Top row (after gender to right) we have boxplots per each gender. Boxplot’s describes the spread/distribution of values. The box part of boxplot describes interquartile range from Q1 (25th percentile) to Q3 (75th percentile), with the median (Q2/50th percentile) of values line in between Q1 and Q3. Circles describe possible outliers (as per presuming normal distribution of the values which is not always the case).

Left side (under gender), barplots display value of row (y-axis) to count of the values (x-axis) (AND BY GENDER).

Scatter plots (plots with points) are column values (x-axis) plotted against row values (y-axis).

We also have pairwise correlation information, one variable against another. Most interesting of teh correlations are Points and Attitude are positively correlated with each other. Deep learning and surface learning are negatively correlated with each other, but only with Males. So the values of these variables are somehow intertwined or in relation to each other.

Print summaries:

summary(learning2014)
##     gender               age           attitude          deep      
##  Length:166         Min.   :17.00   Min.   :1.400   Min.   :1.583  
##  Class :character   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.333  
##  Mode  :character   Median :22.00   Median :3.200   Median :3.667  
##                     Mean   :25.51   Mean   :3.143   Mean   :3.680  
##                     3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.083  
##                     Max.   :55.00   Max.   :5.000   Max.   :4.917  
##       stra            surf           points     
##  Min.   :1.250   Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.625   1st Qu.:2.417   1st Qu.:19.00  
##  Median :3.188   Median :2.833   Median :23.00  
##  Mean   :3.121   Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.625   3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :5.000   Max.   :4.333   Max.   :33.00

For numeric values of each variable we have the minimum and maximum value, mean. Median and 1st Quantile and 3rd Quantile are presented in boxplots (as previously explained). If the value distribution would be normal then 50% of the values would be inbetween Q1 and Q3 values.

3. Three variables as explanatory variables and fit a regression model

Multiple regression model as there are more than three explanatory variables.

Let’s have deep, stra and surf as explanatory variables for target variable points.

# create multiple regression model
my_model <- lm(points ~ deep + stra + surf, data = learning2014)

# print out summary of the model
summary(my_model)
## 
## Call:
## lm(formula = points ~ deep + stra + surf, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.1235  -3.0737   0.5226   4.2799  10.3229 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  26.9143     5.1169   5.260  4.5e-07 ***
## deep         -0.7443     0.8662  -0.859   0.3915    
## stra          0.9878     0.5962   1.657   0.0994 .  
## surf         -1.6296     0.9153  -1.780   0.0769 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.827 on 162 degrees of freedom
## Multiple R-squared:  0.04071,    Adjusted R-squared:  0.02295 
## F-statistic: 2.292 on 3 and 162 DF,  p-value: 0.08016

Does not seem too good. When doing multiple regression we are more interested in adjusted R-squared value, which is quite low, so the model doesn’t really explain. Let’s change up the model by switching one of the explanatory variables.

New model:

my_model <- lm(points ~ attitude + stra + surf, data = learning2014)

# print out summary of the model
summary(my_model)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## stra          0.8531     0.5416   1.575  0.11716    
## surf         -0.5861     0.8014  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

4. i) Explain the relationship between the chosen explanatory variables and the target variable (interpret the model parameters) ii) Explain and interpret the multiple R squared of the model

i) Explain the relationship between the chosen explanatory variables and the target variable (interpret the model parameters

Increase of 1 exam point is associated with 0.339 increase in attitude points. 0.853 is estimated effect of strategic learning to exam points adjusting for attitude. -0.586 is estimated effect of surface learning to exam points adjusting for attitude and strategic learning.

ii) Explain and interpret the multiple R squared of the model

Adjusted R-squared is 0.1927, so almost 20% of variation in exam points can be explained by attitude, strategic learning and surface learning. Adjusted R-squared is the more important measure as we have multiple explanatory variables.

5. Plots: Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage.

Let’s produce the plots: Residual vs Fitted values, Normal QQ-plot and Residuals vs Leverage

# create the model again just in case
my_model <- lm(points ~ attitude + stra + surf, data = learning2014)

# draw diagnostic plots using the plot() function. Choose the plots Residual vs Fitted values, Normal QQ-plot and Residuals vs Leverage (which are 1,2 and 5)
par(mfrow = c(2,2))
plot(my_model, which = c(1,2,5))

Residuals vs Fitted: Reasonable random spread of points. So there shouldn’t be a problem with the assumptions.

Normal Q-Q: In the middle there seems to quite good fit to the normality line, but in the beginning and in the end there is a deviation from the line. So there isn’t a reasonable with to the normality assumption.

Residuals vs Leverage: this plot can help to identify observations that have unusually high impact. There is seemingly few outlier values in the bottom of the plot. So these outliers may highly impact the model in question.


RStudio Exercise 3

2. Read data

alc <- read.csv("https://github.com/rsund/IODS-project/raw/master/data/alc.csv", header = TRUE, sep = ",")

colnames(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "guardian"   "traveltime" "studytime"  "schoolsup" 
## [16] "famsup"     "activities" "nursery"    "higher"     "internet"  
## [21] "romantic"   "famrel"     "freetime"   "goout"      "Dalc"      
## [26] "Walc"       "health"     "n"          "id.p"       "id.m"      
## [31] "failures"   "paid"       "absences"   "G1"         "G2"        
## [36] "G3"         "failures.p" "paid.p"     "absences.p" "G1.p"      
## [41] "G2.p"       "G3.p"       "failures.m" "paid.m"     "absences.m"
## [46] "G1.m"       "G2.m"       "G3.m"       "alc_use"    "high_use"  
## [51] "cid"

Original data set description can be found here. The read data is combination of two data sets regarding the performance in Mathematics and Portuguese language. So originally two different data sets with same variables relating to student grades, demographic, social and school related features.

Variables/column names with suffix .p are originals from por (Portuguese language) dataset and .m from mat (Mathematics) dataset. The combination dataset in question has the following variables: “failures”, “paid”(first value selected), “absences”, “G1”, “G2”, “G3”, are averaged from mat and por datasets. “alc_use” is averaged from “Dalc” workday alcohol consumption (numeric: from 1 - very low to 5 - very high) and “Walc” - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high). “high_use” is TRUE if ‘alc_use’ is higher than 2 and FALSE otherwise.

3. Choosing 4 interesing variables to study relationships between high/low alcohol consumption

higher - wants to take higher education. I hypothesize that students not wanting to take higher education would have higher alcohol consumption.

famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent). I think family relationship might be an interesting variable that might be highly correlated with alcohol consumption as in students with bad family relationships would likely drink more and vice versa.

age - my hypothesis is that older persons will likely have higher alcohol consumption, as the age range is from 15 to 22, over 18 persons can buy alcohol legally.

failures - number of past class failures (numeric: n if 1<=n<3, else 4). I hypotize that high failures would correlate with high alcohol consumption.

4. Exploring distributions of previously chosen variables

Check our variable types etc.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
my_vars <- c("high_use", "higher", "age", "famrel", "failures")
interest_alc <- subset(alc, select = my_vars)

glimpse(interest_alc)
## Rows: 370
## Columns: 5
## $ high_use <lgl> FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, F~
## $ higher   <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes"~
## $ age      <int> 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 1~
## $ famrel   <int> 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 5, 5, 3, 3, 4~
## $ failures <int> 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0~

Let’s examine our selected variables’ “higher”’, “age”, “famrel”, “failures” and “high_use” count distuributions.

library(tidyr); library(ggplot2)

gather(interest_alc) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()

higher

Interesting, it looks like most do want to take higher education. Let’s check the actual counts on that one and plot with them.

alc %>% group_by(higher) %>% summarise(count = n())
## # A tibble: 2 x 2
##   higher count
##   <chr>  <int>
## 1 no        16
## 2 yes      354
alc %>% group_by(high_use, higher) %>% summarise(count = n())
## `summarise()` has grouped output by 'high_use'. You can override using the `.groups` argument.
## # A tibble: 4 x 3
## # Groups:   high_use [2]
##   high_use higher count
##   <lgl>    <chr>  <int>
## 1 FALSE    no         7
## 2 FALSE    yes      252
## 3 TRUE     no         9
## 4 TRUE     yes      102
alc %>% group_by(high_use, higher) %>% summarise(count = n()) %>%
   ggplot(aes(x = higher, y = count, fill = high_use)) +
   geom_col() + 
   facet_wrap("high_use")
## `summarise()` has grouped output by 'high_use'. You can override using the `.groups` argument.

I am somewhat suprised by this that the majority of the students want to continue higher education. So there is very few who overall who do not want to take higher education. Looks like my hypothesis is not withholding.

famrel

Summarize counts of high_use by famrel in table and in barplot.

alc %>% group_by(high_use, famrel) %>% summarise(count = n())
## `summarise()` has grouped output by 'high_use'. You can override using the `.groups` argument.
## # A tibble: 10 x 3
## # Groups:   high_use [2]
##    high_use famrel count
##    <lgl>     <int> <int>
##  1 FALSE         1     6
##  2 FALSE         2     9
##  3 FALSE         3    39
##  4 FALSE         4   128
##  5 FALSE         5    77
##  6 TRUE          1     2
##  7 TRUE          2     9
##  8 TRUE          3    25
##  9 TRUE          4    52
## 10 TRUE          5    23
alc %>% group_by(high_use, famrel) %>% summarise(count = n()) %>%
   ggplot(aes(x = famrel, y = count, fill = high_use)) +
   geom_col() + 
   facet_wrap("high_use")
## `summarise()` has grouped output by 'high_use'. You can override using the `.groups` argument.

famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent). Bad family relations was lower number and good were higher number. Looks like my hypothesis is rather poor in this one. The counts for high_use and bad family relationships are very low. Overall bad family relationships are seem to pretty low with students.

failures

Summarize counts of high_use by failures in table and in barplot.

alc %>% group_by(high_use, failures) %>% summarise(count = n())
## `summarise()` has grouped output by 'high_use'. You can override using the `.groups` argument.
## # A tibble: 8 x 3
## # Groups:   high_use [2]
##   high_use failures count
##   <lgl>       <int> <int>
## 1 FALSE           0   238
## 2 FALSE           1    12
## 3 FALSE           2     8
## 4 FALSE           3     1
## 5 TRUE            0    87
## 6 TRUE            1    12
## 7 TRUE            2     9
## 8 TRUE            3     3
alc %>% group_by(high_use, failures) %>% summarise(count = n()) %>%
   ggplot(aes(x = failures, y = count, fill = high_use)) +
   geom_col() + 
   facet_wrap("high_use")
## `summarise()` has grouped output by 'high_use'. You can override using the `.groups` argument.

I hypotized that high failures would be in relation with high alcohol consumption. Does not seem to be much difference in the groups. Overall failures count is very low and the spread of high alcohol consumption with more than 1 or more failures is very close in non high alcohol consumption with high alcohol consumption groups.

age

alc %>% group_by(high_use) %>% summarise(mean = mean(age))
## # A tibble: 2 x 2
##   high_use  mean
##   <lgl>    <dbl>
## 1 FALSE     16.5
## 2 TRUE      16.8
# initialise a plot of high_use and age
g2 <- ggplot(alc, aes(x = high_use, y = age))

# define the plot as a boxplot and draw it
g2 + geom_boxplot() + ggtitle("Student age by alcohol consumption")

# lets also seem the same with sex added
g3 <- ggplot(alc, aes(x = high_use, y = age, col = sex))
g3 + geom_boxplot() + ggtitle("Student age by alcohol consumption and sex")

alc %>% group_by(high_use, age) %>% summarise(count = n())
## `summarise()` has grouped output by 'high_use'. You can override using the `.groups` argument.
## # A tibble: 12 x 3
## # Groups:   high_use [2]
##    high_use   age count
##    <lgl>    <int> <int>
##  1 FALSE       15    63
##  2 FALSE       16    74
##  3 FALSE       17    62
##  4 FALSE       18    52
##  5 FALSE       19     7
##  6 FALSE       20     1
##  7 TRUE        15    18
##  8 TRUE        16    28
##  9 TRUE        17    35
## 10 TRUE        18    25
## 11 TRUE        19     4
## 12 TRUE        22     1
alc %>% group_by(high_use, age) %>% summarise(count = n()) %>%
   ggplot(aes(x = age, y = count, fill = high_use)) +
   geom_col() + 
   facet_wrap("high_use")
## `summarise()` has grouped output by 'high_use'. You can override using the `.groups` argument.

Average age for higher use is slightly higher, but there does not seem to be a that much more higher age for high alcohol consuming students especially over or 18.

Overall looks like I chose poor variables that might be in very poor / nonexistent relation with high alcohol consumption.

5. Logistic regression with chosen variables

We’ll perform logistic regression to explore the relationships between the famrel, age, higher, failures variables with high/low alcohol consumption as target variable

log_reg_model <- glm(high_use ~ famrel + age + higher + failures, data = alc, family = "binomial")
summary(log_reg_model)
## 
## Call:
## glm(formula = high_use ~ famrel + age + higher + failures, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6954  -0.8036  -0.7110   1.3402   1.8570  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  -1.7777     1.9341  -0.919   0.3580   
## famrel       -0.2815     0.1254  -2.245   0.0248 * 
## age           0.1405     0.1029   1.365   0.1723   
## higheryes    -0.4499     0.5899  -0.763   0.4457   
## failures      0.5594     0.2102   2.661   0.0078 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 432.08  on 365  degrees of freedom
## AIC: 442.08
## 
## Number of Fisher Scoring iterations: 4
# compute odds ratios (OR)
OR <- coef(log_reg_model) %>% exp

# compute confidence intervals (CI)
CI <- confint(log_reg_model) %>% exp
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
##                    OR       2.5 %    97.5 %
## (Intercept) 0.1690283 0.003738272 7.4864947
## famrel      0.7546270 0.589068018 0.9650944
## age         1.1508478 0.941172724 1.4104114
## higheryes   0.6376962 0.198545179 2.0808162
## failures    1.7495838 1.162299822 2.6710151

Seems a bit off since we can’t see other factors for famrel. Let’s add factors.

log_reg_model <- glm(high_use ~ factor(famrel) + age + factor(higher) + failures, data = alc, family = "binomial")
summary(log_reg_model)
## 
## Call:
## glm(formula = high_use ~ factor(famrel) + age + factor(higher) + 
##     failures, family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7501  -0.8559  -0.7101   1.2656   1.8934  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)   
## (Intercept)        -3.1286     2.0803  -1.504  0.13260   
## factor(famrel)2     0.8850     0.9551   0.927  0.35411   
## factor(famrel)3     0.5687     0.8622   0.660  0.50949   
## factor(famrel)4     0.1298     0.8381   0.155  0.87693   
## factor(famrel)5    -0.2315     0.8574  -0.270  0.78714   
## age                 0.1445     0.1036   1.395  0.16310   
## factor(higher)yes  -0.4172     0.5945  -0.702  0.48281   
## failures            0.5515     0.2122   2.599  0.00934 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 429.86  on 362  degrees of freedom
## AIC: 445.86
## 
## Number of Fisher Scoring iterations: 4
# compute odds ratios (OR)
OR <- coef(log_reg_model) %>% exp

# compute confidence intervals (CI)
CI <- confint(log_reg_model) %>% exp
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
##                           OR        2.5 %    97.5 %
## (Intercept)       0.04377755 0.0006880176  2.482282
## factor(famrel)2   2.42308808 0.4105189123 20.100563
## factor(famrel)3   1.76605639 0.3673243645 12.821709
## factor(famrel)4   1.13859072 0.2499432346  7.999589
## factor(famrel)5   0.79333156 0.1665015354  5.717950
## age               1.15541978 0.9437919758  1.417882
## factor(higher)yes 0.65888088 0.2035056764  2.170862
## failures          1.73582033 1.1481286518  2.658353

Let’s try model without intercept.

log_reg_model <- glm(high_use ~ factor(famrel) + age + factor(higher) + failures -1, data = alc, family = "binomial")
summary(log_reg_model)
## 
## Call:
## glm(formula = high_use ~ factor(famrel) + age + factor(higher) + 
##     failures - 1, family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7501  -0.8559  -0.7101   1.2656   1.8934  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)   
## factor(famrel)1    -3.1286     2.0803  -1.504  0.13260   
## factor(famrel)2    -2.2436     1.9831  -1.131  0.25792   
## factor(famrel)3    -2.5599     1.9126  -1.338  0.18075   
## factor(famrel)4    -2.9988     1.9279  -1.556  0.11982   
## factor(famrel)5    -3.3601     1.9482  -1.725  0.08458 . 
## age                 0.1445     0.1036   1.395  0.16310   
## factor(higher)yes  -0.4172     0.5945  -0.702  0.48281   
## failures            0.5515     0.2122   2.599  0.00934 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 512.93  on 370  degrees of freedom
## Residual deviance: 429.86  on 362  degrees of freedom
## AIC: 445.86
## 
## Number of Fisher Scoring iterations: 4
# compute odds ratios (OR)
OR <- coef(log_reg_model) %>% exp

# compute confidence intervals (CI)
CI <- confint(log_reg_model) %>% exp
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
##                           OR        2.5 %   97.5 %
## factor(famrel)1   0.04377755 0.0006880176 2.482282
## factor(famrel)2   0.10607686 0.0021202344 5.145847
## factor(famrel)3   0.07731362 0.0017718458 3.260660
## factor(famrel)4   0.04984471 0.0011021640 2.152387
## factor(famrel)5   0.03473011 0.0007347747 1.554473
## age               1.15541978 0.9437919758 1.417882
## factor(higher)yes 0.65888088 0.2035056764 2.170862
## failures          1.73582033 1.1481286518 2.658353

RStudio Exercise 4

2. Loading data (Boston)

# access the MASS package
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
# load the data
data("Boston")

# Structure and dimensions of the data
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

Original data description can be found here.

The Boston data frame is of housing values in suburbs of Boston and it has 506 rows and 14 columns.

crim - per capita crime rate by town.
zn - proportion of residential land zoned for lots over 25,000 sq.ft.
indus - proportion of non-retail business acres per town.
chas - Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
nox - nitrogen oxides concentration (parts per 10 million).
rm - average number of rooms per dwelling.
age - proportion of owner-occupied units built prior to 1940.
dis - weighted mean of distances to five Boston employment centres.
rad - index of accessibility to radial highways.
tax - full-value property-tax rate per $10,000.
ptratio - pupil-teacher ratio by town.
black - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.
lstat - lower status of the population (percent).
medv - median value of owner-occupied homes in $1000s.

3. Graphical overview and summaries of the data

summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Plot matrix of variables

#fig1, fig.height = 19, fig.width = 17}
library(ggplot2)
library(GGally)

# remove chas variable which is categorical
Boston_cont <- subset(Boston, select = -chas)

ggpairs(Boston_cont,
        upper = list(continuous = wrap('cor', size = 4)),
        title = "Scatterplot matrix of `Boston`")

Very few variable seems to be truely normally distributed. rm (average number of rooms per dwelling) variable seem to be the closest.

Let’s take a more graphical look at the correlations of the variables:

#fig.height = 12, fig.width = 8}
library(corrplot)
## corrplot 0.92 loaded
# calculate the correlation matrix and round it
cor_matrix<-cor(Boston) %>% round(digits = 2)

# print the correlation matrix
cor_matrix
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax ptratio
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58    0.29
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31   -0.39
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72    0.38
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04   -0.12
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67    0.19
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29   -0.36
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51    0.26
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53   -0.23
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91    0.46
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00    0.46
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46    1.00
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44   -0.18
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54    0.37
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47   -0.51
##         black lstat  medv
## crim    -0.39  0.46 -0.39
## zn       0.18 -0.41  0.36
## indus   -0.36  0.60 -0.48
## chas     0.05 -0.05  0.18
## nox     -0.38  0.59 -0.43
## rm       0.13 -0.61  0.70
## age     -0.27  0.60 -0.38
## dis      0.29 -0.50  0.25
## rad     -0.44  0.49 -0.38
## tax     -0.44  0.54 -0.47
## ptratio -0.18  0.37 -0.51
## black    1.00 -0.37  0.33
## lstat   -0.37  1.00 -0.74
## medv     0.33 -0.74  1.00
# visualize the correlation matrix
corrplot(cor_matrix, method="circle", type="upper", cl.pos="b", tl.pos="d", tl.cex = 0.6)

In the correlation matrix visualization above the more red color symbolizes negative correlation the bigger it is. The positive correlation is presented by blue tint and the bigger the circle, bigger the (positive).

Few picks on the correlation matrix:

crim “per capita crime rate by town” is most positively correlated with rad “index of accessibility to radial highways”.

indus “proportion of non-retail business acres per town.” is negatively correlated with dis “weighted mean of distances to five Boston employment centres”. So… bigger proportion of non-retail business acres per town is situated away from Boston’s five employment centers, in other words, industy areas are situated away from center’s of the towns.

4. Standardize dataset

Standardize and scale the variables in the dataset:

# center and standardize variables
boston_scaled <- scale(Boston)

# summaries of the scaled variables
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)

Standardized and scaled variables now have negative values as before there were only positive values. The values are now in “the same ball park” for all the variables.

Here we create a categorical variable of the crime rate in the Boston dataset (from the scaled crime rate) using quantiles (and renaming them “low”, “med_low”, “med_high”, “high”) and we drop the original crime rate variable out:

# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)

# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label = c("low", "med_low", "med_high", "high"))

# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)

# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)

Here we divide the dataset used for training (80% of data) and using to test (20% of the data):

# number of rows in the Boston dataset 
n <- nrow(boston_scaled)

# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

# create train set
train <- boston_scaled[ind,]

5. Fitting the linear discriminant analysis on the train set

Here we fit the linear discriminant analysis (LDA) on the training set. The previously modified categorical crime rate (“low”, “med_low”, “med_high”, “high”) is used as target variable and everything else is used as predictors.

# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)

# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2574257 0.2524752 0.2450495 0.2450495 
## 
## Group means:
##                  zn      indus         chas        nox         rm        age
## low       1.0198499 -0.8949626 -0.158758869 -0.8719810  0.3832515 -0.8584477
## med_low  -0.1293002 -0.3009498 -0.002135914 -0.5832719 -0.1574325 -0.3845243
## med_high -0.3902253  0.2281429  0.284432582  0.3880084  0.1028506  0.4120176
## high     -0.4872402  1.0171737 -0.033716932  1.0829244 -0.4217105  0.7911348
##                 dis        rad        tax     ptratio      black       lstat
## low       0.9064216 -0.6881287 -0.7217461 -0.42759778  0.3736530 -0.74521427
## med_low   0.3777071 -0.5472552 -0.5084229 -0.09720132  0.3117241 -0.13715549
## med_high -0.3624983 -0.4076377 -0.2920244 -0.21134656  0.1130747  0.07511932
## high     -0.8500653  1.6375616  1.5136504  0.78011702 -0.7509322  0.85155087
##                  medv
## low       0.451007498
## med_low  -0.004100017
## med_high  0.144151846
## high     -0.661327872
## 
## Coefficients of linear discriminants:
##                 LD1         LD2         LD3
## zn       0.10728735  0.87705473 -1.02839490
## indus    0.04684106 -0.22325857  0.26894082
## chas    -0.08998964 -0.10854007  0.06772517
## nox      0.35840423 -0.64306742 -1.37820644
## rm      -0.09004108 -0.20130968 -0.23542816
## age      0.27310473 -0.24235389 -0.28871459
## dis     -0.04306363 -0.43333992  0.16861521
## rad      3.11051361  1.16368749  0.14921051
## tax      0.06019964 -0.34316632  0.53648905
## ptratio  0.13563825  0.02285712 -0.42017257
## black   -0.15247626  0.01597731  0.08421988
## lstat    0.27440100 -0.42949498  0.25201583
## medv     0.26726678 -0.46436946 -0.21410814
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9437 0.0410 0.0153

Let’s draw the LDA (bi)plot:

# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2)

Seems that most of the high crime rate is best predicted by rad - index of accessibility to radial highways. zn - proportion of residential land zoned for lots over 25,000 sq.ft. seems to best predict low crime rates. So a defiency of small houses predicts lower crime rate…

nox - nitrogen oxides concentration (parts per 10 million). seems to predict med_high to high crime rate as well.

6. Predicting classes with the LDA model on the test data

# create test set 
test <- boston_scaled[-ind,]

# save the correct classes from test data
correct_classes <- test$crime

# remove the crime variable from test data
test <- dplyr::select(test, -crime)

# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       11      11        1    0
##   med_low    6       8       10    0
##   med_high   1       5       20    1
##   high       0       0        0   28

High crime rates especially seem to be most correctly predicted using the training data. Also low and med_low are predicted correctly most of the cases. The weakest prediction is for med_low crime rate. Note that the sampling (where we choose 80% for training) might have an effect on how well the med_low and med_high are predicted as their prediction is not as “easy” as for low and high crime rate. The more of a cohesive cluster of the target variable (for example in the previous plot) the more easier it is to predict.

7. Standardizing data with comparable distances… and k-means

Here we standardize the data with comparable distances and calculate the distances between the observations.

# center and standardize variables
boston_scaled <- scale(Boston)

# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)

# euclidean distance matrix
dist_eu <- dist(boston_scaled)

# look at the summary of the distances
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970

It is somewhat unclear whether to explore the k-means algorithm with the SCALED data… but here we go:

# fig.height = 19, fig.width = 17}

km <-kmeans(boston_scaled, centers = 3)

# plot the Boston dataset with clusters
pairs(boston_scaled, col = km$cluster)

# fig.height = 19, fig.width = 17}

set.seed(123)

# determine the number of clusters
k_max <- 10

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')

# k-means clustering
km <-kmeans(boston_scaled, centers = 2)

# plot the Boston dataset with clusters
pairs(boston_scaled, col = km$cluster)

With the amount of two clusters you can get pretty decent separation between most variables.

Bonus

Super-bonus

model_predictors <- dplyr::select(train, -crime)

# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = train$crime)

RStudio Exercise 5

1. Read data, data summaries and graphical overview of variables

Read data to object human2, make “first” column as rownames.

human2 <- read.table(file = "http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/human2.txt", header = TRUE, sep = ",", row.names = 1)

# check dimensions (155, 8)
dim(human2)
## [1] 155   8

Let’s check the structure of the data first.

str(human2)
## 'data.frame':    155 obs. of  8 variables:
##  $ Edu2.FM  : num  1.007 0.997 0.983 0.989 0.969 ...
##  $ Labo.FM  : num  0.891 0.819 0.825 0.884 0.829 ...
##  $ Edu.Exp  : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ Life.Exp : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ GNI      : int  64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
##  $ Mat.Mor  : int  4 6 6 5 6 7 9 28 11 8 ...
##  $ Ado.Birth: num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ Parli.F  : num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...

All variables are numeric.

Let’s check summary.

summary(human2)
##     Edu2.FM          Labo.FM          Edu.Exp         Life.Exp    
##  Min.   :0.1717   Min.   :0.1857   Min.   : 5.40   Min.   :49.00  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:11.25   1st Qu.:66.30  
##  Median :0.9375   Median :0.7535   Median :13.50   Median :74.20  
##  Mean   :0.8529   Mean   :0.7074   Mean   :13.18   Mean   :71.65  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :1.4967   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##       GNI            Mat.Mor         Ado.Birth         Parli.F     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50

Edu2.FM was “Proportion of females with at least secondary education” divided with “Proportion of males with at least secondary education”. So values under 1 are means the proportion of females with at least secondary education is lower than in males. If it’s over 1 then females have higher proportion of at least secondary education to males. Looks like females typically have lower proportion of at least secondary education to males (per country)

Labo.FM was “Proportion of females in the labour force” divided by “roportion of males in the labour force”. Same as with previous variable, lower than 1 means the proportion of females working is lower than proportion of males working. This seems to be the case overall.

Edu.Exp is expected years of schooling per country. Minimum value is super low, mean is 13,18 years.

Life.Exp is Life expectancy at birth.

GNI Gross National Income per capita. Really big differences here. Let’s see more with graphical overview

Mat.Mor Maternal mortality ratio is deaths per 100,000 live births. Minimum being 1 death per 100,000 and highest value being 1100. Mean on the other hand is 149,1. So seem the values are really skewed to lower values.

Ado.Birth is Adolescent birth rate is births per 1,000 women ages 15–19.

Parli.F Percetange of female representatives in parliament. Minimum being 0, so all male parliment representatives. Highest reaching only 57,5% of females in parliment. Mean is ~21%.

library(ggplot2)
library(GGally)

ggpairs(human2,
        upper = list(continuous = wrap('cor', size = 4)),
        title = "Scatterplot matrix of selected variables from Human Development Report 2015")
**Figure 1.** Scatterplot/correlations of human2 variables

Figure 1. Scatterplot/correlations of human2 variables

Life.Exp is positively correlated with Edu.Exp. Higher life expectancy correlates with higher expected years in school.

GNI is positively correlated with Edu.Exp. Higher Gross National Income per capita correlates with higher expected years in school. GNI also positively correlates with Life.Exp. Higher gross national income per capita correlates with higher life expectancy.

Mat.mor is negatively correlated with Edu2.FM. Higher Maternal mortality ratio correlates with lower proportion of females with at least secondary education to males. Also higher adolescent birth rate Ado.Birth correlates with lower Edu2.FM.

Higher adolescent birth rate Ado.birth correlates with lower amount of expected years in school (Edu.Exp), lower life expectancy (Life.Exp), lower gross National Income per capita (GNI) and higher maternal mortality rate (Mat.Mor).

2. PCA, biplot with non standardized data

Principal component analysis of the non standardized data.

# perform pca
pca_human <- prcomp(human2)

# summary of pca_human
s <- summary(pca_human)


# rounded percentages of variance captured by each PC
pca_pr <- round(100*s$importance[2,], digits = 1) 
pca_pr
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 
## 100   0   0   0   0   0   0   0
# create object pc_lab to be used as axis labels
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")

# draw a biplot
biplot(pca_human, 
       cex = c(0.8, 1), 
       col = c("grey40", "deeppink2"), 
       xlab = pc_lab[1], 
       ylab = pc_lab[2])
**Figure 2.** biplot of non standardized values of PCA - First prinicipal component explains 100% of the variance in non-standardized data. Second, third etc. principal components explain close to nothing (rounds up to zero). This is typical if values are not measured on the same scale.

Figure 2. biplot of non standardized values of PCA - First prinicipal component explains 100% of the variance in non-standardized data. Second, third etc. principal components explain close to nothing (rounds up to zero). This is typical if values are not measured on the same scale.

First prinicipal component explains 100% of the variance in non-standardized data.

3. PCA, biplot with standardized data

Principal component analysis of standardized data.

# standardize variables
human_std <- scale(human2)

# perform pca
pca_human <- prcomp(human_std)

# summary of pca_human
s <- summary(pca_human)

# rounded percentages of variance captured by each PC
pca_pr <- round(100*s$importance[2,], digits = 1) 
pca_pr
##  PC1  PC2  PC3  PC4  PC5  PC6  PC7  PC8 
## 53.6 16.2  9.6  7.6  5.5  3.6  2.6  1.3
# create object pc_lab to be used as axis labels
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")

# draw a biplot
biplot(pca_human, 
       cex = c(0.8, 1), 
       col = c("grey40", "deeppink2"), 
       xlab = pc_lab[1], 
       ylab = pc_lab[2], 
       caption = "testiiii asgag0aw=-ghjsjhopfahopjdporfjsdpofjsdpojfpsojhgpo4306q0-12-0ra",
       captionLabSize = 10)
**Figure 3.** biplot of PCA of standardized data - Unlike in previous figure 2. (section 3.) now the first principal component explains 53.6% of the variance and second principal component explains 16.2% of the variance, pc3 9.6% etc.

Figure 3. biplot of PCA of standardized data - Unlike in previous figure 2. (section 3.) now the first principal component explains 53.6% of the variance and second principal component explains 16.2% of the variance, pc3 9.6% etc.

Unlike in previous section 3. now the first principal component does not explain 100% of the variance, but 53.6%. Second principal component explains 16.2% of the variance, pc3 9.6% etc.

Our variables are not measured on the same scale. By doing standardization we assign equal importance to all variables.

4. First two principal component dimensions (of standardized data)

Mostly on the first principal component axis Edu.Exp, CNI, Edu2.FM, Life.Exp are going (almost) to the same direction. The countries clustering to the this direction on the axis of first principal component (x-axis) have high educational expectancy, high CNI, high life expectancy, high female to male proportion ratio in at least secondary education. Vector lengths (arrows) describe the size of the effect of that variable on the x-axis for the country.

Mat.Mor and Abo.Birth going to the opposite direction then the previously described group of variables. So High maternal mortality and Adolescent birth rate correlate with each other and the effect of these variables are mostly on the first principal component axis (x-axis).

On the second principal component axis (y-axis) high Parli.F - percentage of female parliment representatives and Labo.FM high female to male proportion ratio of people in labour force are going almost to the same direction, but obviously are little skewed on the x-axis.

There is quite a lot we can say from this picture. For example the cluster of Nordic countries Iceland, Sweden, Norway, Finland, Denmark in upper left side of the previous plot.
These countries have might one or more of the following high:
- Educational expectancy
- Life expectancy
- Gross National Income per capit
- female to male proportion ratio in at least secondary education
- percentage of female parliment representatives
- high female to male proportion ratio of people in labour force

Qatar has one or more of the following high:
- Educational expectancy
- Life expectancy
- Gross National Income per capit
- female to male proportion ratio in at least secondary education
..but quite low
- percentage of female parliment representatives
- high female to male proportion ratio of people in labour force

Iran, Syria, Lebanon, Jordan, Yeme islamic countries have one or more of the following lowest:
- percentage of female parliment representatives
- high female to male proportion ratio of people in labour force

Mozambique, Niger, Mali, Chad, Central African Republic, Congo, Sierra Leone etc. African countries have one or more of the following highest:
- maternal mortality
- adolescent birth rate
Mozambique also has high:

Not all the variables explain the same amount of effect for each country. In some the effects of the variables might differ a lot.

5. Factominer, tea dataset, MCA

Let’s install FactoMineR if we already do not have it. Load the FactoMineR library, load tea data and check tea data’s structure.

#install.packages("FactoMineR")
library(FactoMineR)

# load data
data(tea)

# check structure
str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ frequency       : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
?tea
## starting httpd help server ... done

The data used here concern a questionnaire on tea. We asked to 300 individuals how they drink tea (18 questions), what are their product’s perception (12 questions) and some personal details (4 questions).

Okay, so everything else is categorical variable (factor) with most having two levels (two different options/values), except age which is continuous (integer). 36 variables with 300 rows/observations.

# column names to keep in the dataset
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")

library(dplyr)
library(ggplot2)
library(tidyr)
# select the 'keep_columns' to create a new dataset
tea_time <- select(tea, one_of(keep_columns))

gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free")  + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped

Visualization of few (categorical) variables of tea dataset with level of the variable on x-axis and count on y-axis.

Most people drink tea made from tea bags bought from chain store, most likely Earl Grey, drink it without any additional substances (lemon, milk, other) mostly not during on lunch. Close to have drink their tea with sugar.

# multiple correspondence analysis
mca <- MCA(tea_time, graph = FALSE)

# summary of the model
summary(mca)
## 
## Call:
## MCA(X = tea_time, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.279   0.261   0.219   0.189   0.177   0.156   0.144
## % of var.             15.238  14.232  11.964  10.333   9.667   8.519   7.841
## Cumulative % of var.  15.238  29.471  41.435  51.768  61.434  69.953  77.794
##                        Dim.8   Dim.9  Dim.10  Dim.11
## Variance               0.141   0.117   0.087   0.062
## % of var.              7.705   6.392   4.724   3.385
## Cumulative % of var.  85.500  91.891  96.615 100.000
## 
## Individuals (the 10 first)
##                       Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                  | -0.298  0.106  0.086 | -0.328  0.137  0.105 | -0.327
## 2                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 3                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 4                  | -0.530  0.335  0.460 | -0.318  0.129  0.166 |  0.211
## 5                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 6                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 7                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 8                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 9                  |  0.143  0.024  0.012 |  0.871  0.969  0.435 | -0.067
## 10                 |  0.476  0.271  0.140 |  0.687  0.604  0.291 | -0.650
##                       ctr   cos2  
## 1                   0.163  0.104 |
## 2                   0.735  0.314 |
## 3                   0.062  0.069 |
## 4                   0.068  0.073 |
## 5                   0.062  0.069 |
## 6                   0.062  0.069 |
## 7                   0.062  0.069 |
## 8                   0.735  0.314 |
## 9                   0.007  0.003 |
## 10                  0.643  0.261 |
## 
## Categories (the 10 first)
##                        Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2
## black              |   0.473   3.288   0.073   4.677 |   0.094   0.139   0.003
## Earl Grey          |  -0.264   2.680   0.126  -6.137 |   0.123   0.626   0.027
## green              |   0.486   1.547   0.029   2.952 |  -0.933   6.111   0.107
## alone              |  -0.018   0.012   0.001  -0.418 |  -0.262   2.841   0.127
## lemon              |   0.669   2.938   0.055   4.068 |   0.531   1.979   0.035
## milk               |  -0.337   1.420   0.030  -3.002 |   0.272   0.990   0.020
## other              |   0.288   0.148   0.003   0.876 |   1.820   6.347   0.102
## tea bag            |  -0.608  12.499   0.483 -12.023 |  -0.351   4.459   0.161
## tea bag+unpackaged |   0.350   2.289   0.056   4.088 |   1.024  20.968   0.478
## unpackaged         |   1.958  27.432   0.523  12.499 |  -1.015   7.898   0.141
##                     v.test     Dim.3     ctr    cos2  v.test  
## black                0.929 |  -1.081  21.888   0.382 -10.692 |
## Earl Grey            2.867 |   0.433   9.160   0.338  10.053 |
## green               -5.669 |  -0.108   0.098   0.001  -0.659 |
## alone               -6.164 |  -0.113   0.627   0.024  -2.655 |
## lemon                3.226 |   1.329  14.771   0.218   8.081 |
## milk                 2.422 |   0.013   0.003   0.000   0.116 |
## other                5.534 |  -2.524  14.526   0.197  -7.676 |
## tea bag             -6.941 |  -0.065   0.183   0.006  -1.287 |
## tea bag+unpackaged  11.956 |   0.019   0.009   0.000   0.226 |
## unpackaged          -6.482 |   0.257   0.602   0.009   1.640 |
## 
## Categorical variables (eta2)
##                      Dim.1 Dim.2 Dim.3  
## Tea                | 0.126 0.108 0.410 |
## How                | 0.076 0.190 0.394 |
## how                | 0.708 0.522 0.010 |
## sugar              | 0.065 0.001 0.336 |
## where              | 0.702 0.681 0.055 |
## lunch              | 0.000 0.064 0.111 |
# visualize MCA
#plot(mca, invisible=c("ind"), habillage = "quali")
plot(mca, invisible=c("ind"), habillage = "quali", graph.type = "classic")
**Figure 4.** MCA plot for tea dataset

Figure 4. MCA plot for tea dataset

With the MCA plot we can get some patterns between categorical variables.

For example on the lower right corner we have unpackaged and tea shop clustered together. Likely users who use unpackaged tea to make their tea shop their tea from tea shop.

On the mid-low left side of the plot we have tea bag and chain store clustered together. Likely people who enjoy their tea by tea bags buy them from chain stores most of the time.

Let’s

# multiple correspondence analysis

teaa <- select(tea, -age)
mca <- MCA(teaa, graph = FALSE)

# summary of the model
summary(mca)
## 
## Call:
## MCA(X = teaa, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.090   0.082   0.070   0.063   0.056   0.053   0.050
## % of var.              5.838   5.292   4.551   4.057   3.616   3.465   3.272
## Cumulative % of var.   5.838  11.130  15.681  19.738  23.354  26.819  30.091
##                        Dim.8   Dim.9  Dim.10  Dim.11  Dim.12  Dim.13  Dim.14
## Variance               0.048   0.047   0.044   0.041   0.040   0.039   0.037
## % of var.              3.090   3.053   2.834   2.643   2.623   2.531   2.388
## Cumulative % of var.  33.181  36.234  39.068  41.711  44.334  46.865  49.252
##                       Dim.15  Dim.16  Dim.17  Dim.18  Dim.19  Dim.20  Dim.21
## Variance               0.036   0.035   0.034   0.032   0.031   0.031   0.030
## % of var.              2.302   2.275   2.172   2.085   2.013   2.011   1.915
## Cumulative % of var.  51.554  53.829  56.000  58.086  60.099  62.110  64.025
##                       Dim.22  Dim.23  Dim.24  Dim.25  Dim.26  Dim.27  Dim.28
## Variance               0.028   0.027   0.026   0.025   0.025   0.024   0.024
## % of var.              1.847   1.740   1.686   1.638   1.609   1.571   1.524
## Cumulative % of var.  65.872  67.611  69.297  70.935  72.544  74.115  75.639
##                       Dim.29  Dim.30  Dim.31  Dim.32  Dim.33  Dim.34  Dim.35
## Variance               0.023   0.022   0.021   0.020   0.020   0.019   0.019
## % of var.              1.459   1.425   1.378   1.322   1.281   1.241   1.222
## Cumulative % of var.  77.099  78.523  79.901  81.223  82.504  83.745  84.967
##                       Dim.36  Dim.37  Dim.38  Dim.39  Dim.40  Dim.41  Dim.42
## Variance               0.018   0.017   0.017   0.016   0.015   0.015   0.014
## % of var.              1.152   1.092   1.072   1.019   0.993   0.950   0.924
## Cumulative % of var.  86.119  87.211  88.283  89.301  90.294  91.244  92.169
##                       Dim.43  Dim.44  Dim.45  Dim.46  Dim.47  Dim.48  Dim.49
## Variance               0.014   0.013   0.012   0.011   0.011   0.010   0.010
## % of var.              0.891   0.833   0.792   0.729   0.716   0.666   0.660
## Cumulative % of var.  93.060  93.893  94.684  95.414  96.130  96.796  97.456
##                       Dim.50  Dim.51  Dim.52  Dim.53  Dim.54
## Variance               0.009   0.009   0.008   0.007   0.006
## % of var.              0.605   0.584   0.519   0.447   0.390
## Cumulative % of var.  98.060  98.644  99.163  99.610 100.000
## 
## Individuals (the 10 first)
##                  Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3    ctr
## 1             | -0.580  1.246  0.174 |  0.155  0.098  0.012 |  0.052  0.013
## 2             | -0.376  0.522  0.108 |  0.293  0.350  0.066 | -0.164  0.127
## 3             |  0.083  0.026  0.004 | -0.155  0.099  0.015 |  0.122  0.071
## 4             | -0.569  1.196  0.236 | -0.273  0.304  0.054 | -0.019  0.002
## 5             | -0.145  0.078  0.020 | -0.142  0.083  0.019 |  0.002  0.000
## 6             | -0.676  1.693  0.272 | -0.284  0.330  0.048 | -0.021  0.002
## 7             | -0.191  0.135  0.027 |  0.020  0.002  0.000 |  0.141  0.095
## 8             | -0.043  0.007  0.001 |  0.108  0.047  0.009 | -0.089  0.038
## 9             | -0.027  0.003  0.000 |  0.267  0.291  0.049 |  0.341  0.553
## 10            |  0.205  0.155  0.028 |  0.366  0.546  0.089 |  0.281  0.374
##                 cos2  
## 1              0.001 |
## 2              0.021 |
## 3              0.009 |
## 4              0.000 |
## 5              0.000 |
## 6              0.000 |
## 7              0.015 |
## 8              0.006 |
## 9              0.080 |
## 10             0.052 |
## 
## Categories (the 10 first)
##                  Dim.1    ctr   cos2 v.test    Dim.2    ctr   cos2 v.test  
## breakfast     |  0.182  0.504  0.031  3.022 |  0.020  0.007  0.000  0.330 |
## Not.breakfast | -0.168  0.465  0.031 -3.022 | -0.018  0.006  0.000 -0.330 |
## Not.tea time  | -0.556  4.286  0.240 -8.468 |  0.004  0.000  0.000  0.065 |
## tea time      |  0.431  3.322  0.240  8.468 | -0.003  0.000  0.000 -0.065 |
## evening       |  0.276  0.830  0.040  3.452 | -0.409  2.006  0.087 -5.109 |
## Not.evening   | -0.144  0.434  0.040 -3.452 |  0.214  1.049  0.087  5.109 |
## lunch         |  0.601  1.678  0.062  4.306 | -0.408  0.854  0.029 -2.924 |
## Not.lunch     | -0.103  0.288  0.062 -4.306 |  0.070  0.147  0.029  2.924 |
## dinner        | -1.105  2.709  0.092 -5.240 | -0.081  0.016  0.000 -0.386 |
## Not.dinner    |  0.083  0.204  0.092  5.240 |  0.006  0.001  0.000  0.386 |
##                Dim.3    ctr   cos2 v.test  
## breakfast     -0.107  0.225  0.011 -1.784 |
## Not.breakfast  0.099  0.208  0.011  1.784 |
## Not.tea time   0.062  0.069  0.003  0.950 |
## tea time      -0.048  0.054  0.003 -0.950 |
## evening        0.344  1.653  0.062  4.301 |
## Not.evening   -0.180  0.864  0.062 -4.301 |
## lunch          0.240  0.343  0.010  1.719 |
## Not.lunch     -0.041  0.059  0.010 -1.719 |
## dinner         0.796  1.805  0.048  3.777 |
## Not.dinner    -0.060  0.136  0.048 -3.777 |
## 
## Categorical variables (eta2)
##                 Dim.1 Dim.2 Dim.3  
## breakfast     | 0.031 0.000 0.011 |
## tea.time      | 0.240 0.000 0.003 |
## evening       | 0.040 0.087 0.062 |
## lunch         | 0.062 0.029 0.010 |
## dinner        | 0.092 0.000 0.048 |
## always        | 0.056 0.035 0.007 |
## home          | 0.016 0.002 0.030 |
## work          | 0.075 0.020 0.022 |
## tearoom       | 0.321 0.019 0.031 |
## friends       | 0.186 0.061 0.030 |
# visualize MCA
#plot(mca, invisible=c("ind"), habillage = "quali")
plot(mca, invisible=c("ind"), habillage = "quali", graph.type = "classic")
**Figure 5.** MCA plot for tea dataset with all categorical variables

Figure 5. MCA plot for tea dataset with all categorical variables

#library(factoextra)
#fviz_mca_biplot(mca, 
#                repel = TRUE,
#                ggtheme = theme_minimal())

With all other variables excluding continuous variable age we get this kind of plot. NOTE! With more categorical variables we get a lot less explanation of the variance by our correspondence dimensions only 5.84% for dim1 and 5.29% for dim2 compared to Figure 4. 15.24% for dim1 and 14.23% for dim2!!!


(more chapters to be added similarly as we proceed with the course!)